State of the Union Analyses¶
Background¶
The State of the Union (SotU) is a annual message delivered by the president of the United States to the Congress. It is a chance for the president to speak to the nation and make a case for their agenda and current administration. The speech lays out the priorities for an administration and president, as well as the overall health, especially economic, of the nation. Major issues of the day are addressed.
For this project we have the transcribed speeches of each SotU between 1945 and 2016. The time period covers 12 different presidents, 6 Democrats and 6 Republicans. The raw text for each speech can be found here.
Questions¶
The president is mandated by the Constitution to give a SotU speech every single year; and given that the SotU is perhaps the most important political speech given each and every year, we have a wealth of data to dive into. Using NLP algorithms, we are able to break down and analyze each speech in great detail to try to detect patterns and trends. This project will focus on answering the following questions in order to give this project direction:
- Are there variations in tone or sentiment between presidents?
- What could account for changes in sentiment in each SotU? Are there indictors which could predict sentiment?
- How does the language used in these speeches change over time, and are their noticeable patterns?
- Are there major differences in the speeches given between presidents of the two different political parties.
NLP Basics¶
Natural language processing (NLP) is a field in machine learning that focuses on the interaction of human language and computers. In particular, NLP is concerned with analyzing large amounts of natural language data to help us better understand and interpret human language. One of the most important tools of any NLP package is a process known as tokenization. Tokenization breaks down a sentence, speech, book, etc. into individual words which are referred to as tokens. Each token can be analyzed and manipulated. A token can be assigned a part of speech, sentiment, subjectivity, can be made singular or plural, etc.
Another vital component of any NLP is the concept of stop words. Stop words are commonly used words (ie. the, and, is) that a program has been taught to ignore. If you are looking for the most common words used in a book, having these words show up as the highest results ever time would skew many analyses, and therefore are often left out.
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
import spacy
from textblob import TextBlob
%matplotlib inline
import warnings
#warnings.simplefilter(action='ignore', category=[FutureWarning, RuntimeWarning])
# import all the SOTU txt files
data_folder = 'state-of-the-union-corpus-1989-2024'
file_list = [f for f in os.listdir(data_folder) if not f.startswith(".")]
# create lists for the year and president who gave the speech
years = []
potus = []
for file in file_list:
year = file[:4]
pres = file[5:-4]
pres = re.sub('[^a-zA-Z]+', '', pres)
years.append(year)
potus.append(pres)
# create dataframe
df = pd.DataFrame(file_list, columns=['path'])
df['year'] = years
# print the first 5 rows of the dataframe
print(df.head())
df['year'] = df['year'].astype(int)
df['president'] = potus
# add state-of-the-union-corpus-1989-2024 to the beginning of each path
df['path'] = data_folder + '/' + df['path']
# create a list of Democrat presidents
dems = ['Truman', 'Kennedy', 'Johnson', 'Carter', 'Clinton', 'Obama']
# add party affiliation to the dataframe
df['party'] = df.president.apply(lambda x: 'Democrat' if x in dems else 'Republican')
path year 0 1990-Bush.txt 1990 1 2022-Biden.txt 2022 2 1957-Eisenhower.txt 1957 3 1988-Reagan.txt 1988 4 1946-Truman.txt 1946
nltk.download("punkt")
# use textblob to tokenize the words in each speech
df['tokens'] = df.path.apply(lambda x: TextBlob(open(x, encoding='ISO-8859-1').read()).words)
# use textblob to perform a sentiment analysis of each speech
df['sentiment'] = df.path.apply(lambda x: TextBlob(open(x, encoding='ISO-8859-1').read()).sentiment.polarity)
[nltk_data] Downloading package punkt to [nltk_data] /Users/zachberman/nltk_data... [nltk_data] Package punkt is already up-to-date!
# import GDP growth and unemployment data and merge with main dataframe
economic_df = pd.read_csv('gdp_data.csv')
# convert 12/31/2017 to year
economic_df['year'] = economic_df['Date'].apply(lambda x: x[6:])
economic_df['year'] = economic_df['year'].astype(int)
# rename GDP Growth Rate to gdp_rate
economic_df.rename(columns={'GDP Growth Rate':'gdp_rate'}, inplace=True)
# drop date
economic_df.drop('Date', axis=1, inplace=True)
df = df.merge(economic_df, how="left", on="year")
unemployment_df = pd.read_csv('unemployment_data.csv')
# rename Rate to unemployment_rate, Year to year
unemployment_df.rename(columns={'Rate':'unemployment_rate', 'Year':'year'}, inplace=True)
df = df.merge(unemployment_df, how='left', on='year')
df['unemployment_rate'] = df['unemployment_rate']*100
df['gdp_rate'] = df['gdp_rate']*100
# re index the dataframe by year
df = df.sort_values(by="year").reset_index(drop=True)
print(df.head())
path year president \
0 state-of-the-union-corpus-1989-2024/1945-Truma... 1945 Truman
1 state-of-the-union-corpus-1989-2024/1946-Truma... 1946 Truman
2 state-of-the-union-corpus-1989-2024/1947-Truma... 1947 Truman
3 state-of-the-union-corpus-1989-2024/1948-Truma... 1948 Truman
4 state-of-the-union-corpus-1989-2024/1949-Truma... 1949 Truman
party tokens sentiment \
0 Democrat [PRESIDENT, HARRY, S, TRUMAN, 'S, ADDRESS, BEF... 0.103764
1 Democrat [PRESIDENT, HARRY, S, TRUMAN, 'S, MESSAGE, TO,... 0.111545
2 Democrat [PRESIDENT, HARRY, S, TRUMAN, 'S, ANNUAL, MESS... 0.136731
3 Democrat [PRESIDENT, HARRY, S, TRUMAN, 'S, ANNUAL, MESS... 0.165403
4 Democrat [PRESIDENT, HARRY, S, TRUMAN, 'S, ANNUAL, MESS... 0.155075
gdp_rate unemployment_rate
0 11.45 1.9
1 10.67 3.9
2 -0.01 3.9
3 3.80 4.0
4 -1.50 6.6
#check data types and display dataframe
print(df.shape)
print(df.dtypes)
df.head()
(81, 8) path object year int64 president object party object tokens object sentiment float64 gdp_rate float64 unemployment_rate float64 dtype: object
| path | year | president | party | tokens | sentiment | gdp_rate | unemployment_rate | |
|---|---|---|---|---|---|---|---|---|
| 0 | state-of-the-union-corpus-1989-2024/1945-Truma... | 1945 | Truman | Democrat | [PRESIDENT, HARRY, S, TRUMAN, 'S, ADDRESS, BEF... | 0.103764 | 11.45 | 1.9 |
| 1 | state-of-the-union-corpus-1989-2024/1946-Truma... | 1946 | Truman | Democrat | [PRESIDENT, HARRY, S, TRUMAN, 'S, MESSAGE, TO,... | 0.111545 | 10.67 | 3.9 |
| 2 | state-of-the-union-corpus-1989-2024/1947-Truma... | 1947 | Truman | Democrat | [PRESIDENT, HARRY, S, TRUMAN, 'S, ANNUAL, MESS... | 0.136731 | -0.01 | 3.9 |
| 3 | state-of-the-union-corpus-1989-2024/1948-Truma... | 1948 | Truman | Democrat | [PRESIDENT, HARRY, S, TRUMAN, 'S, ANNUAL, MESS... | 0.165403 | 3.80 | 4.0 |
| 4 | state-of-the-union-corpus-1989-2024/1949-Truma... | 1949 | Truman | Democrat | [PRESIDENT, HARRY, S, TRUMAN, 'S, ANNUAL, MESS... | 0.155075 | -1.50 | 6.6 |
Sentiment Analysis¶
A sentiment analysis analyzes each token in from language data and assigns them a rating based on how positive or negative the connotation of that token is. In this instance, the scale goes from -1 to 1 with -1 being the most negative words possible and 1 being the words with a very positive meaning. For example, the word 'great' would have a higher score than 'good, and the word 'awful' would have a lower score than the word 'bad'. Words with no positive or negative leanings are given a score of zero and not factored into the analysis. The package TextBlob has a built in algorithm which has a long list of words already scored, and I will be utilizing that feature.
Below, you can see the sentiment analysis grouped by president. Every SotU has had an overall positive sentiment ranging from .08 to .20. The SotU must strike a careful balance. On one hand, the SotU is a reflection of an administration and the health of the nation, and therefore the president would not want to overly negative, as people would assume the administration is doing a poor job. On the other hand, if the SotU is too optimistic or positive, an administration could come off as out of touch to the suffering of certain sectors of the populous, alienating them. It seems the strategy is to find a balance of, what I will call 'cautious optimism.' Things are going well, but there is always room for improvement.
How these sentiments changed throughout a presidency is also worth noting. Of the 14 presidents included here, 8 had their sentiment scores increase over time. Of the Democrats, only Kennedy and Biden showed a negative trend in sentiment, that was very minor. Conversely, the Republican presidents were split evenly, half trending positive and half negative.
# plot sentiment analysis data
# Create the plot with sorted hue
g = sns.lmplot(
x="year", y="sentiment", hue="president", aspect=2.5, truncate=True, data=df
)
g._legend.set_title("President")
plt.xlabel("Year")
plt.ylabel("Sentiment")
plt.title("SotU Sentiment Analysis by President")
plt.show()
Economic Factors¶
Each presidency has to deal with a unique set of issues and challenges, but one constant is the state of the economy. I wanted to take a surface level look at some important economic indicators to see if there was any correlation between changing sentiment. Two of the major economic indicators for the economy are the GDP growth rate and the unemployment rate. Below I have created identical style graphs for visual comparison. A visual analysis seems to show that there isn't much correlation between GDP and sentiment. If there was a correlation, you would expect to see sentiment increasing as GDP rates increased. However, you there does appear to be an inverse correlation between unemployment and sentiment. As unemployment rates decrease, sentiment appears to be increasing. Of course, we can do more than just visual inspect this data.
In order to run a correlation analysis, we have to check if our features have normal distributions. All three populations show distributions normal enough to run a pearson correlation analysis. To visualize the relationships between the populations, scatterplots have been created. You can see that there is almost no trend between sentiment and GDP rates, while there is an inverse trend between unemployment and sentiment. The pearson correlation backs up these trends. The relationship between unemployment and sentiment has an r-value of -0.27 and a p-value of 0.016. We need to be wary of making any causal claims, this is just a correlation. Also, p-values in pearson tests aren't as reliable with populations under around 500. More economic data would help establish an actual relationship with a regression. However, if unemployment is going up, we would expect the SotU to be less optimistic as more voters are out of work and probably not very happy.
#plot GDP growth data
g1 = sns.lmplot(x="year", y="gdp_rate", hue="president", aspect=2.5, truncate=True, data=df)
g1._legend.set_title('President')
plt.xlabel('Year')
plt.ylabel('% GDP Growth Rate')
plt.title('GDP Growth Rate by Year')
plt.show()
#plot unemployment rate data
g2 = sns.lmplot(x="year", y="unemployment_rate", hue="president", aspect=2.5, truncate=True, data=df)
g2._legend.set_title('President')
plt.xlabel('Year')
plt.ylabel('% Unemployment Rate')
plt.title('Unemployment Rate by Year')
plt.show()
plt.figure(figsize=(16, 5))
plt.subplot(131)
plt.hist(df.sentiment)
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.subplot(132)
plt.hist(df.gdp_rate)
plt.xlabel('GDP Growth Rate')
plt.ylabel('Count')
plt.subplot(133)
plt.hist(df.unemployment_rate)
plt.xlabel('Unemployment Rate')
plt.ylabel('Count')
plt.show()
plt.figure(figsize=(15, 5))
plt.subplot(121)
sns.regplot(x=df['gdp_rate'], y=df['sentiment'])
plt.xlabel('GDP Growth Rate')
plt.ylabel('Sentiment')
plt.subplot(122)
sns.regplot(x=df['unemployment_rate'], y=df['sentiment'])
plt.xlabel('Unemployment Rate')
plt.ylabel('Sentiment')
plt.show()
from scipy.stats import pearsonr
print('Sentiment-GDP Correleation:', pearsonr(df.sentiment, df.gdp_rate))
print('Sentiment-Unemployment Correleation:', pearsonr(df.sentiment, df.unemployment_rate))
Sentiment-GDP Correleation: PearsonRResult(statistic=np.float64(0.06024595959639101), pvalue=np.float64(0.5931535000622122)) Sentiment-Unemployment Correleation: PearsonRResult(statistic=np.float64(-0.26593866380842063), pvalue=np.float64(0.016411662157805686))
Most Common Words¶
Psychology has shown that the more often a word or phrase is used in a speech, the more it sticks with an audience. As such, the key, defining themes of a speech can often be derived by looking at the most commonly used words. Here, the spaCy package is implemented to tokenize each speech once again. This analysis could also be done via TextBlob, however spaCy offers a more robust tool set for this particular feature. The speeches are cleaned up and stop words are removed from the analysis. For this project, I am only analyzing the 20 most common words for each speech, but any word could be analyzed here. If you wanted to look at economic terms like job, inflation, etc., you could do that as well.
Word clouds were created for each speech to make visualizing the shifts in language more appealing. A write up on the thematic shifts by president, or by era, or inside a presidency itself could probably fill a book. Take the time to scan through and find patterns. See if you can notice when large global events have occurred (wars, recessions, booms).
In this particular project, I am interested in seeing if presidents of different parties have unique ways of speaking or common themes they use over time. As such, clustering models will be utilized.
# utility function to clean text
def text_cleaner(text):
# Replace double dash '--' with a space
text = re.sub(r"--", " ", text)
# Remove headings in brackets
text = re.sub(r"\[.*?\]", "", text)
# Remove extra whitespace
text = " ".join(text.split())
return text
#load spacy model
nlp = spacy.load('en_core_web_sm')
#add features with text for each speech and run through the text cleaner
df['text'] = df.path.apply(lambda x: open(x, encoding='ISO-8859-1').read())
df['text'] = df.text.apply(lambda x: text_cleaner(x))
df['nlp'] = df.text.apply(lambda x: nlp(x))
df.head()
| path | year | president | party | tokens | sentiment | gdp_rate | unemployment_rate | text | nlp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | state-of-the-union-corpus-1989-2024/1945-Truma... | 1945 | Truman | Democrat | [PRESIDENT, HARRY, S, TRUMAN, 'S, ADDRESS, BEF... | 0.103764 | 11.45 | 1.9 | PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A J... | (PRESIDENT, HARRY, S., TRUMAN, 'S, ADDRESS, BE... |
| 1 | state-of-the-union-corpus-1989-2024/1946-Truma... | 1946 | Truman | Democrat | [PRESIDENT, HARRY, S, TRUMAN, 'S, MESSAGE, TO,... | 0.111545 | 10.67 | 3.9 | PRESIDENT HARRY S. TRUMAN'S MESSAGE TO THE CON... | (PRESIDENT, HARRY, S., TRUMAN, 'S, MESSAGE, TO... |
| 2 | state-of-the-union-corpus-1989-2024/1947-Truma... | 1947 | Truman | Democrat | [PRESIDENT, HARRY, S, TRUMAN, 'S, ANNUAL, MESS... | 0.136731 | -0.01 | 3.9 | PRESIDENT HARRY S. TRUMAN'S ANNUAL MESSAGE TO ... | (PRESIDENT, HARRY, S., TRUMAN, 'S, ANNUAL, MES... |
| 3 | state-of-the-union-corpus-1989-2024/1948-Truma... | 1948 | Truman | Democrat | [PRESIDENT, HARRY, S, TRUMAN, 'S, ANNUAL, MESS... | 0.165403 | 3.80 | 4.0 | PRESIDENT HARRY S. TRUMAN'S ANNUAL MESSAGE TO ... | (PRESIDENT, HARRY, S., TRUMAN, 'S, ANNUAL, MES... |
| 4 | state-of-the-union-corpus-1989-2024/1949-Truma... | 1949 | Truman | Democrat | [PRESIDENT, HARRY, S, TRUMAN, 'S, ANNUAL, MESS... | 0.155075 | -1.50 | 6.6 | PRESIDENT HARRY S. TRUMAN'S ANNUAL MESSAGE TO ... | (PRESIDENT, HARRY, S., TRUMAN, 'S, ANNUAL, MES... |
from collections import Counter
nltk.download("stopwords")
# save a list of stop words
stop_words = stopwords.words('english')
# utility function that retuns a list of the most common words
def word_frequencies(text, include_stop=False):
# Build a list of words.
# Strip out punctuation and, optionally, stop words.
words = []
for token in text:
#exclude punctuation, numeric characters and words in the stop list
if not token.is_punct and token.is_alpha and not token.is_stop and token.lower_ not in stop_words:
words.append(token.text)
# Build and return a Counter object containing word counts.
return Counter(words)
[nltk_data] Downloading package stopwords to [nltk_data] /Users/zachberman/nltk_data... [nltk_data] Package stopwords is already up-to-date!
from wordcloud import WordCloud
# Function to create word clouds for each speech
def word_clouds(text_list, df):
num_items = len(text_list)
cols = 2 # Number of columns
rows = (num_items + cols - 1) // cols
plt.figure(figsize=(15, rows * 5))
for n, texts in enumerate(text_list):
plt.subplot(rows, cols, n + 1)
wordcloud = WordCloud(stopwords=stop_words, max_words=20).generate(texts)
plt.imshow(wordcloud, interpolation="bilinear")
plt.title(f"{df.president[n]} {df.year[n]}")
plt.axis("off")
plt.tight_layout()
plt.show()
word_clouds(df.text, df)